Skip to content

feat: add array_normalize scalar function#22013

Merged
Jefffrey merged 1 commit into
apache:mainfrom
crm26:feat/array-normalize
May 20, 2026
Merged

feat: add array_normalize scalar function#22013
Jefffrey merged 1 commit into
apache:mainfrom
crm26:feat/array-normalize

Conversation

@crm26
Copy link
Copy Markdown
Contributor

@crm26 crm26 commented May 5, 2026

Which issue does this PR close?

Part of #21536 — split of #21371 into one-function-per-PR. Third in the series after #21542 (cosine_distance) and #21861 (inner_product).

Rationale for this change

Adds array_normalize(array) — the L2-normalized version of a numeric input vector. Computed as array[i] / sqrt(sum(array[i]^2)) per element. Returns the same shape as the input (List<Float64> or LargeList<Float64>).

Aliased as list_normalize to match the array_X/list_X convention used across the crate.

What changes are included in this PR?

Coercion shell mirrors the merged cosine_distance/inner_product pattern:

  • coerce_types accepts List/LargeList/FixedSizeList of any numeric inner type, plus bare NULL. After coercion the inner function only sees List(Float64) or LargeList(Float64).
  • Per-row L2 norm computed inline (no shared module), using a single as_float64_array(list_array.values()) downcast plus value_offsets() slicing — no per-row downcasts.
  • Manual list builder: Vec<f64> for values, Vec<O> for offsets, NullBuffer for row validity.

Per-row semantics:

  • NULL row → NULL output
  • NULL element in list → NULL row
  • Empty list → empty list (no division-by-zero hazard)
  • Zero magnitude → NULL row (consistent with cosine_distance's zero-magnitude → NULL)
  • Otherwise → divide each element by sqrt(sum-of-squares)

Are these changes tested?

Yes. SLT covers:

  • 3-4-5 right triangle, 3D vector, already-unit-axis, single non-zero component, negative components
  • Bare NULL input, NULL element in list, zero vector, empty array
  • LargeList, FixedSizeList (via coercion), Float32 and Int64 inner types, integer literals
  • Multi-row query mixing normal / NULL row / zero-vector row / null-element row
  • Plan error for non-list input
  • No-args error
  • Return-type assertion (List(Float64))
  • list_normalize alias coverage (constant + multi-row with NULL)

Are there any user-facing changes?

New scalar function array_normalize (alias list_normalize), documented in docs/source/user-guide/sql/scalar_functions.md.

@github-actions github-actions Bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels May 5, 2026
let mut new_values: Vec<f64> = Vec::with_capacity(values.len());
let mut new_offsets: Vec<O> = Vec::with_capacity(list_array.len() + 1);
new_offsets.push(O::usize_as(0));
let mut validity: Vec<bool> = Vec::with_capacity(list_array.len());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use NullBufferBuilder here instead. One benefit is when finishing it, it may output None if there are no nulls (currently we always provide a null buffer even if there are no nulls)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swapped to NullBufferBuilderappend_null() / append_non_null() per row, nulls.finish() returns None when no nulls accumulated, so we stop emitting a redundant null buffer on all-valid inputs. Thanks.

let offsets = list_array.value_offsets();

let mut new_values: Vec<f64> = Vec::with_capacity(values.len());
let mut new_offsets: Vec<O> = Vec::with_capacity(list_array.len() + 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be simpler to use OffsetBufferBuilder here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jefffrey — swapped to OffsetBufferBuilder in c3576a30e. Each branch now uses push_length(0) for null/null-element/empty/zero-mag rows and push_length(len) for valid rows; final buffer from new_offsets.finish(). Cleaner than the manual Vec<O> + OffsetBuffer::new(... .into()).

@crm26 crm26 force-pushed the feat/array-normalize branch from 557d221 to c3576a3 Compare May 16, 2026 19:11
@Jefffrey Jefffrey added this pull request to the merge queue May 20, 2026
@Jefffrey
Copy link
Copy Markdown
Contributor

Thanks @crm26

Merged via the queue into apache:main with commit 821260f May 20, 2026
36 checks passed
crm26 added a commit to crm26/datafusion that referenced this pull request May 22, 2026
Adds `array_add(array1, array2)` returning the element-wise sum of two
numeric arrays. Aliased as `list_add`. Follows the per-function split
pattern established by cosine_distance (apache#21542), inner_product (apache#21861),
and array_normalize (apache#22013) per tracking issue apache#21536.

Semantics:
- NULL row in either input -> NULL row out
- NULL element at position i in either input -> NULL element at i out
  (per-element propagation, divergent from inner_product which nulls
  the whole row; chosen because output is a list, not a scalar)
- Length mismatch between rows -> exec_err
- Empty arrays -> empty array

Supports List, LargeList, and FixedSizeList inputs; numeric element
types are coerced to Float64. If any input is LargeList, both sides
are widened to LargeList for homogeneous runtime dispatch.

Uses OffsetBufferBuilder + NullBufferBuilder per the pattern adopted
in array_normalize round 1.
crm26 added a commit to crm26/datafusion that referenced this pull request May 22, 2026
Adds `array_add(array1, array2)` returning the element-wise sum of two
numeric arrays. Aliased as `list_add`. Follows the per-function split
pattern established by cosine_distance (apache#21542), inner_product (apache#21861),
and array_normalize (apache#22013) per tracking issue apache#21536.

Semantics:
- NULL row in either input -> NULL row out
- NULL element at position i in either input -> NULL element at i out
  (per-element propagation, divergent from inner_product which nulls
  the whole row; chosen because output is a list, not a scalar)
- Length mismatch between rows -> exec_err
- Empty arrays -> empty array

Supports List, LargeList, and FixedSizeList inputs; numeric element
types are coerced to Float64. If any input is LargeList, both sides
are widened to LargeList for homogeneous runtime dispatch.

Uses OffsetBufferBuilder + NullBufferBuilder per the pattern adopted
in array_normalize round 1.
crm26 added a commit to crm26/datafusion that referenced this pull request May 22, 2026
Adds `array_add(array1, array2)` returning the element-wise sum of two
numeric arrays. Aliased as `list_add`. Follows the per-function split
pattern established by cosine_distance (apache#21542), inner_product (apache#21861),
and array_normalize (apache#22013) per tracking issue apache#21536.

Semantics:
- NULL row in either input -> NULL row out
- NULL element at position i in either input -> NULL element at i out
  (per-element propagation, divergent from inner_product which nulls
  the whole row; chosen because output is a list, not a scalar)
- Length mismatch between rows -> exec_err
- Empty arrays -> empty array

Supports List, LargeList, and FixedSizeList inputs; numeric element
types are coerced to Float64. If any input is LargeList, both sides
are widened to LargeList for homogeneous runtime dispatch.

Uses OffsetBufferBuilder + NullBufferBuilder per the pattern adopted
in array_normalize round 1.
zhuqi-lucas pushed a commit to zhuqi-lucas/arrow-datafusion that referenced this pull request May 26, 2026
## Which issue does this PR close?

Partial of apache#21536 — `array_scale` (the list+scalar arithmetic function
in the vector math series).

## Rationale for this change

Continues the per-function split requested by @alamb on apache#21536. Three
sibling PRs already merged: `cosine_distance` (apache#21542), `inner_product`
(apache#21861), `array_normalize` (apache#22013). `array_add` is in flight as apache#22459
by @SubhamSinghal.

Adds element-wise scalar multiplication for numeric arrays, returning a
list of the same shape. Aliased as `list_scale` to match the `array_X` /
`list_X` precedent in this crate.

## What changes are included in this PR?

- New scalar UDF `array_scale(array, scalar)` in
`datafusion/functions-nested/src/array_scale.rs`
- Module wire-up + registration in
`datafusion/functions-nested/src/lib.rs`
- SLT tests at `datafusion/sqllogictest/test_files/array_scale.slt`
- Auto-generated function docs entry in
`docs/source/user-guide/sql/scalar_functions.md`

**Signature:** first arg `List/LargeList/FixedSizeList<numeric>`, second
arg numeric scalar. Both coerce to `Float64`. Same list-widening rules
as the binary-op siblings.

**NULL semantics:**
- NULL row in array → NULL row out
- NULL scalar → NULL row out (whole-row, because the scalar applies
uniformly)
- NULL element at position \`i\` → NULL element at \`i\` out
(per-element propagation)
- Empty array → empty array

**Builders:** uses \`OffsetBufferBuilder\` + \`NullBufferBuilder\` per
the pattern adopted in the round-1 review of apache#22013.

## Are these changes tested?

Yes. \`array_scale.slt\` covers:
- Happy paths (positive, negative, zero, fractional, single-element)
- NULL propagation at all three levels (NULL row, NULL scalar, NULL
element)
- All list type variants (\`List\`, \`LargeList\`, \`FixedSizeList\`)
- Numeric inner type coercion (Float32, Int64, integer literals)
- Multi-row queries with both constant-scalar broadcast and per-row
column scalar
- Error paths (non-numeric scalar, non-list first arg, wrong arity)
- Empty array
- \`list_scale\` alias

## Are there any user-facing changes?

Yes — new SQL scalar function \`array_scale(array, scalar)\` and its
alias \`list_scale\`. Documented in
\`docs/source/user-guide/sql/scalar_functions.md\`.
crm26 added a commit to crm26/datafusion that referenced this pull request May 26, 2026
Adds `array_sum(array)` returning the sum of elements in a numeric array.
Aliased as `list_sum`. Part of the per-function split sequence on
tracking issue apache#21536, following the pattern of the already-merged PRs
in this series (cosine_distance apache#21542, inner_product apache#21861,
array_normalize apache#22013, array_scale apache#22466).

Semantics:
- NULL row in array -> NULL row out
- NULL elements are skipped (SQL aggregate convention; matches
  PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose
  every element is NULL yields NULL.
- Empty array -> 0.0 (additive identity, matches SQL SUM over no rows
  conceptually, and DuckDB list_sum([]) = 0)

Input is List/LargeList/FixedSizeList of any numeric type; elements
are coerced to Float64. Output is Float64.
crm26 added a commit to crm26/datafusion that referenced this pull request May 26, 2026
Adds `array_sum(array)` returning the sum of elements in a numeric array.
Aliased as `list_sum`. Part of the per-function split sequence on
tracking issue apache#21536, following the pattern of the already-merged PRs
in this series (cosine_distance apache#21542, inner_product apache#21861,
array_normalize apache#22013, array_scale apache#22466).

Semantics:
- NULL row in array -> NULL row out
- NULL elements are skipped (SQL aggregate convention; matches
  PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose
  every element is NULL yields NULL.
- Empty array -> 0.0 (additive identity, matches SQL SUM over no rows
  conceptually, and DuckDB list_sum([]) = 0)

Input is List/LargeList/FixedSizeList of any numeric type; elements
are coerced to Float64. Output is Float64.
crm26 added a commit to crm26/datafusion that referenced this pull request May 26, 2026
Adds `array_sum(array)` returning the sum of elements in a numeric array.
Aliased as `list_sum`. Part of the per-function split sequence on
tracking issue apache#21536, following the pattern of the already-merged PRs
in this series (cosine_distance apache#21542, inner_product apache#21861,
array_normalize apache#22013, array_scale apache#22466).

Semantics:
- NULL row in array -> NULL row out
- NULL elements are skipped (SQL aggregate convention; matches
  PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose
  every element is NULL yields NULL.
- Empty array -> 0.0 (additive identity, matches SQL SUM over no rows
  conceptually, and DuckDB list_sum([]) = 0)

Input is List/LargeList/FixedSizeList of any numeric type; elements
are coerced to Float64. Output is Float64.
crm26 added a commit to crm26/datafusion that referenced this pull request May 26, 2026
Adds `array_sum(array)` returning the sum of elements in a numeric array.
Aliased as `list_sum`. Part of the per-function split sequence on
tracking issue apache#21536, following the pattern of the already-merged PRs
in this series (cosine_distance apache#21542, inner_product apache#21861,
array_normalize apache#22013, array_scale apache#22466).

Semantics:
- NULL row in array -> NULL row out
- NULL elements are skipped (SQL aggregate convention; matches
  PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose
  every element is NULL yields NULL.
- Empty array -> 0.0 (additive identity, matches SQL SUM over no rows
  conceptually, and DuckDB list_sum([]) = 0)

Input is List/LargeList/FixedSizeList of any numeric type; elements
are coerced to Float64. Output is Float64.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants